25 research outputs found

    Investigating data sharing in speech recognition for an underresourced language: the case of algerian dialect

    Get PDF
    International audienceThe Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data

    An enhanced automatic speech recognition system for Arabic

    Get PDF
    International audienceAutomatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition , it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right hamoza above or below Alif

    CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube

    Get PDF
    International audienceThis paper addresses the issue of comparability of commentsextracted from Youtube. The comments concern spokenAlgerian which could be either local Arabic, Modern StandardArabic or French. This diversity of expression arises a hugenumber of problems concerning the data processing. In thisarticle, several methods of alignment will be proposed andtested. The method which permits to best align is Word2Vec-basedapproach that will be used iteratively. This recurrentcall of Word2Vec allows to improve significantly the resultsof comparability. In fact, a dictionary-based approach leadsto a Recall of 4, while our approach allows to get a Recall of33 at rank 1. Thanks to this approach, we built from YoutubeCALYOU, a Comparable Corpus of the spoken Algerian

    Machine Translation on a parallel Code-Switched Corpus

    Get PDF
    International audienceCode-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate code-switched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning

    Analyse de sentiments des vidéos en dialecte algérien

    Get PDF
    La plupart des travaux existant sur l’analyse de sentiments traitent l’arabe standard moderne et ne prennent pas en considération les spécificités de l’arabe dialectal. Cet article présente un système d’analyse de sentiments de textes extraits de vidéos exprimées en dialecte algérien. Dans ce travail, nous avons deux défis à surmonter, la reconnaissance automatique de la parole pour le dialecte algérien et l’analyse de sentiments du texte reconnu. Le développement du système de reconnaissance automatique de la parole est basé sur un corpus oral restreint. Pour pallier le manque de données, nous proposons d’exploiter des données ayant un impact sur le dialecte algérien, à savoir l’arabe standard et le français. L’analyse de sentiments est fondée sur la détection automatique de la polarité des mots en fonction de leur proximité sémantique avec d’autres mots ayant une polarité prédéterminée

    A new language model based on possibility theory

    Get PDF
    Nous n'avons pas encore la date officielle de parutionInternational audienceLanguage modeling is a very important step in several NLP applications. Most of the current language models are based on probabilistic methods. In this paper, we propose a new language modeling approach based on the possibility theory. Our goal is to suggest a method for estimating the possibility of a word-sequence and to test this new approach in a machine translation system.We propose a word-sequence possibilistic measure, which can be estimated from a corpus. We proceeded in two ways: first, we checked the behaviour of the newapproach compared with the existing work. Second, we compared the new language model with the probabilistic one used in statistical MTsystems. The results, in terms of the METEOR metric, show that the possibilistic-language model is better than the probabilistic one. However, in terms of BLEU and TER scores, the probabilistic model remains better

    Predicting and Critiquing Machine Virtuosity: Mawwal Accompaniment as Case Study

    Get PDF
    International audienceThe evaluation of machine virtuosity is critical to improving the quality of virtual instruments, and may also help predict future impact. In this contribution, we evaluate and predict the virtuosity of a statistical machine translation model that provides an automatic responsive accompaniment to mawwal, a genre of Arab vocal improvisation. As an objective evaluation used in natural language processing (BLEU score) did not adequately assess the model's output, we focused on subjective evaluation. First, we culturally locate virtuosity within the particular Arab context of tarab, or modal ecstasy. We then analyze listening test evaluations, which suggest that the corpus size needs to increase to 18K for machine and human accompaniment to be comparable. We also posit that the relationship between quality and inter-evaluator disagreement follows a higher order polynomial function. Finally, we gather suggestions from a musician in a user experience study for improving machine-induced tarab. We were able to infer that the machine's lack of integration into tarab may be due, in part, to its dependence on a tri-gram language model, and instead suggest using a four-or five-gram model. In the conclusion, we note the limitations of language models for music translation

    Machine Translation on a parallel Code-Switched Corpus

    Get PDF
    International audienceCode-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate code-switched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning

    Development of the Arabic Loria Automatic Speech Recognition system (ALASR) and its evaluation for Algerian dialect

    Get PDF
    International audienceThis paper addresses the development of an Automatic Speech Recognition system for Modern Standard Arabic (MSA) and its extension to Algerian dialect. Algerian dialect is very different from Arabic dialects of the Middle-East, since it is highly influenced by the French language. In this article, we start by presenting the new automatic speech recognition named ALASR (Arabic Loria Automatic Speech Recognition) system. The acoustic model of ALASR is based on a DNN approach and the language model is a classical n-gram. Several options are investigated in this paper to find the best combination of models and parameters. ALASR achieves good results for MSA in terms of WER (14.02%), but it completely collapses on an Algerian dialect data set of 70 minutes (a WER of 89%). In order to take into account the impact of the French language, on the Algerian dialect, we combine in ALASR two acoustic models, the original one (MSA) and a French one trained on ESTER corpus. This solution has been adopted because no transcribed speech data for Algerian dialect are available. This combination leads to a substantial absolute reduction of the word error of 24%. c 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational Linguistics

    Adaptation of speech recognition vocabularies for improved transcription of YouTube videos

    Get PDF
    International audienceThis paper discusses the adaptation of speech recognition vocabularies for automatic speech transcription. The context is the transcription of YouTube videos in French, English and Arabic. Base-line automatic speech recognition systems have been developed using previously available data. However, the available text data, including the GigaWord corpora from LDC, are getting quite old with respect to recent YouTube videos that are to be transcribed. After a discussion on the performance of the ASR baseline systems, the paper presents the collection of recent textual data from internet for updating the speech recognition vocabularies and for training the language models, as well as the elaboration of development data sets necessary for the vocabulary selection process. The paper also compares the coverage of the training data collected from internet, and of the GigaWord data, with finite size vocabularies made of the most frequent words. Finally, the paper presents and discusses the amount of out-of-vocabulary word occurrences, before and after the update of the speech recognition vocabularies, for the three languages. Moreover, some speech recognition evaluation results are provided and analyzed
    corecore